Compression of Semistructured Documents
نویسندگان
چکیده
EGOTHOR is a search engine that indexes the Web and allows us to search the Web documents. Its hit list contains URL and title of the hits, and also some snippet which tries to shortly show a match. The snippet can be almost always assembled by an algorithm that has a full knowledge of the original document (mostly HTML page). It implies that the search engine is required to store the full text of the documents as a part of the index. Such a requirement leads us to pick up an appropriate compression algorithm which would reduce the space demand. One of the solutions could be to use common compression methods, for instance gzip or bzip2, but it might be preferable if we develop a new method which would take advantage of the document structure, or rather, the textual character of the documents. There already exist a special compression text algorithms and methods for a compression of XML documents. The aim of this paper is an integration of the two approaches to achieve an optimal level of the compression ratio. Keywords— Compression, search engine, HTML, XML.
منابع مشابه
Semantic Lossy Compression of XML Data
In the last years a large amount of semistructured data [1, 10] has been managed and exchanged. The largest repository of semistructured data is the World Wide Web, which can be thought of as an enormous database in which data is highly heterogeneous and freely correlated. In this scenario is placed Extensible Markup Language (XML) [14], a language for semistructured data standardised by the Wo...
متن کاملSemistructured Data Store Mapping with XML and Its Reconstruction
XML has been quickly emerging as a dominant standard for data representation and exchange on the World Wide Web for its many good features such as well-formed structure or semantic support. Research on semistructured data over the last several years has focused on data models, query languages, and systems where the database is modeled in some form of a labeled, directed graph. Processing this a...
متن کاملUsing structural contexts to compress semistructured text collections
We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts t...
متن کاملInstance-Independent View Serializability for Semistructured Databases
Semistructured databases require tailor-made concurrency control mechanisms since traditional solutions for the relational model have been shown to be inadequate. Such mechanisms need to take full advantage of the hierarchical structure of semistructured data, for instance allowing concurrent updates of subtrees of, or even individual elements in, XML documents. We present an approach for concu...
متن کاملDiscovery of Frequent Tag Tree Patterns in Semistructured Web Documents
Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge lab...
متن کامل